Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

نویسندگان

Ryan Kiros

Ruslan Salakhutdinov

Richard S. Zemel

چکیده

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Oxford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car* "blue" + "red" is near images of red cars. Sample captions generated for 800 images are made available for comparison.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Incorporating visual features into word embeddings: A bimodal autoencoder-based approach

Multimodal semantic representation is an evolving area of research in natural language processing as well as computer vision. Combining or integrating perceptual information, such as visual features, with linguistic features is recently being actively studied. This paper presents a novel bimodal autoencoder model for multimodal representation learning: the autoencoder learns in order to enhance...

متن کامل

Order embeddings and character-level convolutions for multimodal alignment

With the novel and fast advances in the area of deep neural networks, several challenging image-based tasks have been recently approached by researchers in pattern recognition and computer vision. In this paper, we address one of these tasks, which is to match image content with natural language descriptions, sometimes referred as multimodal content retrieval. Such a task is particularly challe...

متن کامل

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

In this paper, we focus on training and evaluating effective word embeddings with both text and visual information. More specifically, we introduce a large-scale dataset with 300 million sentences describing over 40 million images crawled and downloaded from publicly available Pins (i.e. an image with sentence descriptions uploaded by users) on Pinterest [2]. This dataset is more than 200 times...

متن کامل

Multi-Modal Representations for Improved Bilingual Lexicon Learning

Recent work has revealed the potential of using visual representations for bilingual lexicon learning (BLL). Such image-based BLL methods, however, still fall short of linguistic approaches. In this paper, we propose a simple yet effective multimodal approach that learns bilingual semantic representations that fuse linguistic and visual input. These new bilingual multi-modal embeddings display ...

متن کامل

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

The success of word representations (embeddings) learned from text has motivated analogous methods to learn representations of longer sequences of text such as sentences, a fundamental step on any task requiring some level of text understanding [13]. Sentence representation is a challenging task that has to consider aspects such as compositionality, phrase similarity, negation, etc. In order to...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1411.2539 شماره

صفحات -

تاریخ انتشار 2014

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

نویسندگان

چکیده

منابع مشابه

Incorporating visual features into word embeddings: A bimodal autoencoder-based approach

Order embeddings and character-level convolutions for multimodal alignment

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images

Multi-Modal Representations for Improved Bilingual Lexicon Learning

Evaluating Multimodal Representations on Sentence Similarity: vSTS, Visual Semantic Textual Similarity Dataset

عنوان ژورنال:

اشتراک گذاری